Subword-Based Text Retrieval
نویسندگان
چکیده
Document retrieval in languages with a rich and complex morphology – particularly in terms of derivation and (single-word) composition – suffers from serious performance degradation with the stemming-only query-term-totext-word matching paradigm. We propose an alternative approach in which morphologically complex word forms are segmented into relevant subwords (such as stems, prefixes, suffixes), and subwords constitute the basic unit for indexing and retrieval. We evaluate our approach on a large biomedical document collection.
منابع مشابه
Subword-based approaches for spoken document retrieval
This paper explores approaches to the problem of spoken document retrieval (SDR), which is the task of automatically indexing and then retrieving relevant items from a large collection of recorded speech messages in response to a user specified natural language text query. We investigate the use of subword unit representations for SDR as an alternative to words generated by either keyword spott...
متن کاملAn Investigation of Subword Unit Representations for Spoken Document Retrieval
This study investigates the feasibility of using subword unit representations for spoken document retrieval as an alternative to using words generated by either keyword spotting or word recognition. Our investigation is motivated by the observation that word-based retrieval approaches face the problem of either having to know the keywords to search for a priori, or requiring a very large recogn...
متن کاملOpen-vocabulary spoken document retrieval based on new subword models and subword phonetic similarity
A new type of video retrieval system is proposed that identifies a target video section by searching for a word passage submitted as a quoted speech or text query. The proposed system has two unique characteristics. The first characteristic is that it is based on subword models such as phonemes, syllables, and morphemes so the system is able to deal with any type of query, including new words a...
متن کاملMulti-Scale Spoken Document Retrieval for Cantonese Broadcast News
This paper presents the application of a multi-scale paradigm to Chinese spoken document retrieval (SDR) for improving retrieval performance. Multi-scale refers to the use of both words and subwords for retrieval. Words are basic units in a language that carry lexical meaning and subword units (such as phonemes, syllables or characters) are building components for words. Retrieval using subword...
متن کاملMulti-scale-audio indexing for translingual spoken document retrieval
MEI (Mandarin-English Information) is an English-Chinese crosslingual spoken document retrieval (CL-SDR) system developed during the Johns Hopkins University Summer Workshop 2000. We integrate speech recognition, machine translation, and information retrieval technologies to perform CL-SDR. MEI advocates a multi-scale paradigm, where both Chinese words and subwords (characters and syllables) ar...
متن کاملImage retrieval using the combination of text-based and content-based algorithms
Image retrieval is an important research field which has received great attention in the last decades. In this paper, we present an approach for the image retrieval based on the combination of text-based and content-based features. For text-based features, keywords and for content-based features, color and texture features have been used. Query in this system contains some keywords and an input...
متن کامل